L’objet de cette page est de présenter l’approche adoptée pour analyser les donneés temporelles associées aux camions et captées par le système SIWIM, puis pour construire le meilleur modèle de prédiction possible.
Nous choisissons de poursuivre nos analyses sur la base du jeu de données dont les données manquantes ont été imputées :
##
Read 65.2% of 183928 rows
Read 183928 rows and 39 (of 39) columns from 0.056 GB file in 00:00:03
| variable | classe | first_values |
|---|---|---|
| V1 | character | 1, 2, 3, 4, 5, 6 |
| Timestamp | character | 2017-11-23 00:00:42, 2017-11-23 00:21:31, 2017-11-23 00:21:32, 2017-11-23 00:24:14, 2017-11-23 00:24:53, 2017-11-23 00:30:35 |
| Site_ID | character | Normandie, Normandie, Normandie, Normandie, Normandie, Normandie |
| Warning_flags | character | 00000000, 00000000, 00000000, 00000000, 00000000, 00000000 |
| N | integer | 5, 2, 2, 5, 5, 5 |
| Subclass_ID | character | 113, 40, 20, 113, 113, 74 |
| Axle_groups | character | 113, 11, 11, 113, 113, 113 |
| A1 | double | 3.89637, 3.53769, 2.37186, 3.80111, 3.7235, 3.84 |
| A2 | double | 5.67876, NA, NA, 5.92265, 5.67742, 6 |
| A3 | double | 1.32642, NA, NA, 1.28177, 1.32719, 1.4 |
| A4 | double | 1.32642, NA, NA, 1.37017, 1.32719, 1.32 |
| A5 | double | NA, NA, NA, NA, NA, NA |
| A6 | double | NA, NA, NA, NA, NA, NA |
| A7 | double | NA, NA, NA, NA, NA, NA |
| Date | character | 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23 |
| Horaire | character | 00:00:42, 00:21:31, 00:21:32, 00:24:14, 00:24:53, 00:30:35 |
| Annee | character | 2017, 2017, 2017, 2017, 2017, 2017 |
| Mois_num | character | 11, 11, 11, 11, 11, 11 |
| Mois_annee | character | novembre, novembre, novembre, novembre, novembre, novembre |
| Jour_num | character | 22, 22, 22, 22, 22, 22 |
| Jour_semaine | character | mercredi, mercredi, mercredi, mercredi, mercredi, mercredi |
| Heure | character | 00, 00, 00, 00, 00, 00 |
| M1 | double | 4.68117227319062, 1.95471967380224, 0.887510703363914, 5.09427115188583, 5.06275229357798, 4.55823649337411 |
| M2 | double | 8.72759429153925, 2.87089704383282, 0.770931702344546, 8.10930682976555, 9.0244750254842, 8.61866462793068 |
| M3 | double | 4.24332313965342, NA, NA, 1.74763506625892, 4.65559633027523, 2.85606523955148 |
| M4 | double | 5.91610601427115, NA, NA, 3.56166156982671, 6.88427115188583, 4.34165137614679 |
| M5 | double | 5.82585117227319, NA, NA, 3.37384301732926, 6.48813455657492, 3.96273190621814 |
| M6 | double | NA, NA, NA, NA, NA, NA |
| M7 | double | NA, NA, NA, NA, NA, NA |
| M8 | double | NA, NA, NA, NA, NA, NA |
| Lane | character | droite, droite, droite, droite, droite, droite |
| Nb_axles | character | 5, 2, 2, 5, 5, 5 |
| Anomalie | character | N, N, N, N, N, N |
| total_axle_dist | double | 12.228, 3.53769, 2.37186, 12.3757, 12.0553, 12.56 |
| T | double | 12.9083, 12.8783, 12.8783, 12.91, 12.9167, 12.835 |
| Reduced_chi_squared | double | 391.589, 4.25249, 13.8461, 401.025, 521.619, 184.582 |
| Time_num | double | 0.905556905864198, 0.905597061471193, 0.905597093621399, 0.905602301954733, 0.905603555812757, 0.905614551183128 |
| Vitesse | double | 76.40208, 74.09844, 74.09844, 81.46728, 67.95216, 73.728 |
| MGV | double | 29.3940876656473, 4.82561671763507, 1.65844036697248, 21.886748216106, 32.1151885830785, 24.3373088685015 |
On effectue quelques conversions post import des données :
#Format date & time
siwim_data[, Timestamp := as.POSIXct(substr(siwim_data$Timestamp,1,19), format = '%Y-%m-%d %H:%M:%S')]
siwim_data[, Date := as.Date(siwim_data$Date)]
# Refactor time data
cols <- c(
"Mois_annee",
"Jour_semaine",
"Heure",
"Axle_groups"
)
siwim_data[, (cols) := lapply(.SD, function(x) as.factor(x)), .SDcols=cols]
# Heure au format numérique
siwim_data[,Heure_num := as.numeric(Heure)]
#Delete unuseful columns
maxN <- 8
axles_load <- paste("A", 1:(maxN-1), sep = "")
axles_mass <- paste("M", 1:maxN, sep = "")
useless_cols <- c(axles_load,axles_mass, "Site_ID")
siwim_data[, (useless_cols) := NULL]
# Structure apres conversions
#str(siwim_data)
data.frame(variable = names(siwim_data),
classe = sapply(siwim_data, typeof),
first_values = sapply(siwim_data, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable("html") %>%
kable_styling()
| variable | classe | first_values |
|---|---|---|
| V1 | character | 1, 2, 3, 4, 5, 6 |
| Timestamp | double | 2017-11-23 00:00:42, 2017-11-23 00:21:31, 2017-11-23 00:21:32, 2017-11-23 00:24:14, 2017-11-23 00:24:53, 2017-11-23 00:30:35 |
| Warning_flags | character | 00000000, 00000000, 00000000, 00000000, 00000000, 00000000 |
| N | integer | 5, 2, 2, 5, 5, 5 |
| Subclass_ID | character | 113, 40, 20, 113, 113, 74 |
| Axle_groups | integer | 113, 11, 11, 113, 113, 113 |
| Date | double | 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23 |
| Horaire | character | 00:00:42, 00:21:31, 00:21:32, 00:24:14, 00:24:53, 00:30:35 |
| Annee | character | 2017, 2017, 2017, 2017, 2017, 2017 |
| Mois_num | character | 11, 11, 11, 11, 11, 11 |
| Mois_annee | integer | novembre, novembre, novembre, novembre, novembre, novembre |
| Jour_num | character | 22, 22, 22, 22, 22, 22 |
| Jour_semaine | integer | mercredi, mercredi, mercredi, mercredi, mercredi, mercredi |
| Heure | integer | 00, 00, 00, 00, 00, 00 |
| Lane | character | droite, droite, droite, droite, droite, droite |
| Nb_axles | character | 5, 2, 2, 5, 5, 5 |
| Anomalie | character | N, N, N, N, N, N |
| total_axle_dist | double | 12.228, 3.53769, 2.37186, 12.3757, 12.0553, 12.56 |
| T | double | 12.9083, 12.8783, 12.8783, 12.91, 12.9167, 12.835 |
| Reduced_chi_squared | double | 391.589, 4.25249, 13.8461, 401.025, 521.619, 184.582 |
| Time_num | double | 0.905556905864198, 0.905597061471193, 0.905597093621399, 0.905602301954733, 0.905603555812757, 0.905614551183128 |
| Vitesse | double | 76.40208, 74.09844, 74.09844, 81.46728, 67.95216, 73.728 |
| MGV | double | 29.3940876656473, 4.82561671763507, 1.65844036697248, 21.886748216106, 32.1151885830785, 24.3373088685015 |
| Heure_num | double | 1, 1, 1, 1, 1, 1 |
Nous avons décider de partir sur une prévision dans l’heure, donc nous allons agréger les donnnées par date et par heure. Par cette opération, nous allons compter le nombre de camions, sommer les poids et les distances entre esseiux des camions et enfin prendre la moyenne de température, de la vitesse des camions
## Scaling data by day and hour (choosen time window)
siwim_data_hours <- siwim_data[, .(Count=.N, Total_Weight=sum(MGV),
Total_axle_dist=sum(total_axle_dist),
T_mean=mean(T),
Vitesse_mean=mean(Vitesse)), by = .(Date, Heure)]
#str(siwim_data_hours)
data.frame(variable = names(siwim_data_hours),
classe = sapply(siwim_data_hours, typeof),
first_values = sapply(siwim_data_hours, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable("html") %>%
kable_styling()
| variable | classe | first_values |
|---|---|---|
| Date | double | 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23 |
| Heure | integer | 00, 01, 02, 03, 04, 05 |
| Count | integer | 14, 11, 9, 24, 29, 45 |
| Total_Weight | double | 275.833261977574, 292.930397553517, 283.591437308868, 501.946666666667, 604.64493374108, 916.674852191641 |
| Total_axle_dist | double | 140.26137, 118.71912, 106.3454, 262.82355, 298.64435, 503.139055830308 |
| T_mean | double | 12.8463, 12.7856, 12.7861222222222, 12.8314666666667, 13.0237965517241, 13.1572577777778 |
| Vitesse_mean | double | 75.21444, 72.8242690909091, 67.67436, 79.000275, 77.4667117241379, 71.239432 |
Nous créeons ensuite un certain nombre de variables catégorielles et numériques autour du temps (année, mois, semaine, jour, jour de la semaine.
siwim_data_hours[, ':=' (Annee = as.numeric(format(as.Date(Date), format = "%Y")),
Mois_annee = factor(format(as.Date(Date), format = "%B")),
Mois_num = as.numeric(format(as.Date(Date), format = "%m")),
Jour_num = as.numeric(format(as.Date(Date), format = "%d")),
Jour_sem_num = as.numeric(format(as.Date(Date), format = "%w")),
Semaine_num = as.numeric(format(as.Date(Date), format = "%V")),
Semaine_fac = factor(format(as.Date(Date), format = "%V")),
Heure_num = as.numeric(Heure)
)]
#Variable catégorielle du jour de la semaine
siwim_data_hours[, Jour_semaine := factor(weekdays(Date),
levels = c("lundi", "mardi", "mercredi",
"jeudi", "vendredi", "samedi",
"dimanche"))]
siwim_data_hours$Jour_semaine <- relevel(siwim_data_hours$Jour_semaine, ref = "lundi")
#Time step creation
siwim_data_hours[, time_step := as.POSIXct(paste(as.character(Date), paste(Heure, '00','00', sep = ':'), sep = " "))]
Voici une réprsentation graphique des différentes données :
Quelques indications de la disribution par jour et par heure.
#box plot par mois
#boxplot(Count ~ Mois_annee, data = siwim_data_hours)
ggplot(siwim_data_hours, aes(x=Mois_annee, y=Count, fill=Mois_annee)) +
geom_boxplot() + scale_x_discrete(limits=c("juillet", "août", "septembre", "octobre", "novembre"))
#box plot par jour de la semaine
#boxplot(Count ~ Jour_semaine, data = siwim_data_hours)
ggplot(siwim_data_hours, aes(x=Jour_semaine, y=Count, fill=Jour_semaine)) +
geom_boxplot() + scale_x_discrete(limits=c("lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi", "dimanche"))
#Box plot par Heure
#boxplot(Count ~ Heure, data = siwim_data_hours)
ggplot(siwim_data_hours, aes(x=Heure, y=Count, fill=Heure)) +
geom_boxplot()
Plusieurs périodes montrent un manque de données dû surement à une défection du système SIWIM. L’idee de reconstituer les valeurs de fréquences par interpolation linéaire ou spline. Pou cela, nous allons compléter la série temporelle par les dates et heures manquantes.
##Get full sequence
full_sequence <- seq(from=min(siwim_data_hours$time_step),
to=max(siwim_data_hours$time_step), by="h")
##Grab the missing sequence
missing_sequence <- full_sequence[!(full_sequence %in% siwim_data_hours$time_step)]
# missing_sequence[sample(.N,3)]
#sapply(missing_sequence[], sample, 10)
n_samples <- sample(1:length(missing_sequence), 30, replace = TRUE)
kable(missing_sequence[n_samples])
## Warning in kable_markdown(x = structure(c("2017-09-19 18:00:00",
## "2017-08-24 05:00:00", : The table should have a header (column names)
| 2017-09-19 18:00:00 |
| 2017-08-24 05:00:00 |
| 2017-09-15 04:00:00 |
| 2017-09-20 04:00:00 |
| 2017-10-06 16:00:00 |
| 2017-09-18 12:00:00 |
| 2017-09-17 08:00:00 |
| 2017-10-10 03:00:00 |
| 2017-09-15 09:00:00 |
| 2017-08-23 16:00:00 |
| 2017-07-19 21:00:00 |
| 2017-09-18 01:00:00 |
| 2017-09-12 15:00:00 |
| 2017-10-05 13:00:00 |
| 2017-10-13 20:00:00 |
| 2017-07-08 22:00:00 |
| 2017-08-26 00:00:00 |
| 2017-09-19 20:00:00 |
| 2017-10-10 09:00:00 |
| 2017-08-28 00:00:00 |
| 2017-08-26 13:00:00 |
| 2017-09-12 14:00:00 |
| 2017-09-19 18:00:00 |
| 2017-08-27 17:00:00 |
| 2017-10-06 11:00:00 |
| 2017-10-08 20:00:00 |
| 2017-08-27 06:00:00 |
| 2017-09-15 16:00:00 |
| 2017-09-20 05:00:00 |
| 2017-10-15 12:00:00 |
Nous construisons un nouveau data set uniquement pour les dates et heures manquantes. Les données catégorielles et numériques liées au temps sont déduites en même temps. Enfin, nous fusionnons le dataset de données existantes et celui de données manquantes.
siwim_data_hours_missing <- data.table(time_step = missing_sequence,
Date = as.Date(strftime(missing_sequence, format = "%Y-%m-%d")),
Annee = as.numeric(strftime(missing_sequence, format = "%Y")),
Mois_annee = factor(strftime(missing_sequence, format = "%B")),
Mois_num = as.numeric(strftime(missing_sequence, format = "%m")),
Jour_num = as.numeric(strftime(missing_sequence, format = "%d")),
Jour_sem_num = as.numeric(strftime(missing_sequence, format = "%w")))
siwim_data_hours_missing[, Jour_semaine := factor(weekdays(Date))]
## Données existantes et manquantes
siwim_data_hours_full <- rbindlist(list(siwim_data_hours, siwim_data_hours_missing), fill = T)
Voici les dates où les données manquent :
head(siwim_data_hours_full[siwim_data_hours_full$time_step %in% missing_sequence]) %>% kable()
| Date | Heure | Count | Total_Weight | Total_axle_dist | T_mean | Vitesse_mean | Annee | Mois_annee | Mois_num | Jour_num | Jour_sem_num | Semaine_num | Semaine_fac | Heure_num | Jour_semaine | time_step |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2017-07-08 | NA | NA | NA | NA | NA | NA | 2017 | juillet | 7 | 8 | 6 | NA | NA | NA | samedi | 2017-07-08 22:00:00 |
| 2017-07-09 | NA | NA | NA | NA | NA | NA | 2017 | juillet | 7 | 9 | 0 | NA | NA | NA | dimanche | 2017-07-09 05:00:00 |
| 2017-07-16 | NA | NA | NA | NA | NA | NA | 2017 | juillet | 7 | 16 | 0 | NA | NA | NA | dimanche | 2017-07-16 01:00:00 |
| 2017-07-16 | NA | NA | NA | NA | NA | NA | 2017 | juillet | 7 | 16 | 0 | NA | NA | NA | dimanche | 2017-07-16 02:00:00 |
| 2017-07-19 | NA | NA | NA | NA | NA | NA | 2017 | juillet | 7 | 19 | 3 | NA | NA | NA | mercredi | 2017-07-19 17:00:00 |
| 2017-07-19 | NA | NA | NA | NA | NA | NA | 2017 | juillet | 7 | 19 | 3 | NA | NA | NA | mercredi | 2017-07-19 18:00:00 |
Nous réalisons ensuite une interpolation linéaire et spline sur le dataset complet.
siwim_data_hours_count_full_2 <- na.interpolation(siwim_data_hours_full$Count)
siwim_data_hours_count_full_3 <- na.interpolation(siwim_data_hours_full$Count, option = "spline")
siwim_data_dt_full <- data.table(time_step = siwim_data_hours_full$time_step,
full_interp_lm = siwim_data_hours_count_full_2,
full_interp_spline = siwim_data_hours_count_full_3)
## Graphique des 2 interpolations
df <- melt(siwim_data_dt_full[, c("time_step", "full_interp_lm",
"full_interp_spline")], id="time_step")
ggplot(df) + geom_line(aes(x=time_step, y=value, color=variable)) + facet_wrap( ~ variable, scales="free")
L’interpolation spline semble mieux compléter la série temporelle.
Avant d’utiliser d’appliquer un modèle classique sur les séries temporelles, nous devons vérifier si elle est stationnaire. Une série temporelle stationnaire doit satisfaire 3 critères : * sa moyenne doit être constante dans le temps * sa variance doit être indépendante du temps * la covariance entre termes doit être indépendante du temps.
Le test de Dickey-fuller a pour hypothèse \(H_0\) qu’une racine unitaire est présente dans une modèle auto-regressif d’ordre 1. L’hypothèse alternative est la stationnarité de la série temporelle.
\(y_t = \rho y_{t-1} + u_t\). Une racine unitaire est présente si \(\rho = 1\).
Cette équation est reformulée par différence de \(y_t\) avec \(y_{t-1}\) : \(\nabla y_t = (\rho - 1) y_{t-1} + u_t = \delta y_{t-1} + u_t\)
L’hypothèse \(H_0\) devient : \(\delta = 0\). 2 autres versions du test existent : * avec une constante intiale : \(y_t = a_0 + \rho y_{t-1} + u_t\) * avec une tendance temporelle : \(y_t = a_0 + a_1 t+ \rho y_{t-1} + u_t\)
Le test augmenté de Dickey-fuller supprime tous les effets structurels dus aux autocorrelations d’ordre supérieur à 1.
# test Dickey Fuller for stationarity detection
X <- siwim_data_hours$Count
lags <- 0
z <- diff(X)
n <- length(z)
z.diff <- embed(z, lags+1)[,1]
z.lag.1 <- X[(lags+1):n]
summary(lm(z.diff~0+z.lag.1 ))
##
## Call:
## lm(formula = z.diff ~ 0 + z.lag.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.376 -8.007 0.321 9.376 163.289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.032108 0.004815 -6.669 3.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.68 on 2726 degrees of freedom
## Multiple R-squared: 0.01605, Adjusted R-squared: 0.01569
## F-statistic: 44.47 on 1 and 2726 DF, p-value: 3.113e-11
#Augmented Dickey Fuller
lags <- 1
z <- diff(X)
n <- length(z)
z.diff <- embed(z, lags+1)[,1]
z.lag.1 <- X[(lags+1):n]
k <- lags + 1
z.diff.lag <- embed(z, lags+1)[, 2:k]
summary(lm(z.diff~0+z.lag.1+z.diff.lag ))
##
## Call:
## lm(formula = z.diff ~ 0 + z.lag.1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.027 -6.670 1.128 9.827 165.632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.041184 0.004658 -8.842 <2e-16 ***
## z.diff.lag 0.282663 0.018379 15.380 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.73 on 2724 degrees of freedom
## Multiple R-squared: 0.09467, Adjusted R-squared: 0.094
## F-statistic: 142.4 on 2 and 2724 DF, p-value: < 2.2e-16
#Augmented Dickey Fuller with trend and drift
summary(lm(z.diff~1+z.lag.1+z.diff.lag ))
##
## Call:
## lm(formula = z.diff ~ 1 + z.lag.1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.688 -10.127 -3.242 6.763 160.407
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.807625 0.619152 9.38 <2e-16 ***
## z.lag.1 -0.086047 0.006626 -12.99 <2e-16 ***
## z.diff.lag 0.305085 0.018249 16.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.37 on 2723 degrees of freedom
## Multiple R-squared: 0.123, Adjusted R-squared: 0.1224
## F-statistic: 191 on 2 and 2723 DF, p-value: < 2.2e-16
time <- 1:length(z.diff)
#Augmented Dickey Fuller with trend and drift and time trend
summary(lm(z.diff~1+time+z.lag.1+z.diff.lag ))
##
## Call:
## lm(formula = z.diff ~ 1 + time + z.lag.1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.650 -10.169 -3.222 6.654 160.380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0430908 0.9722388 6.216 5.9e-10 ***
## time -0.0001711 0.0005446 -0.314 0.753
## z.lag.1 -0.0860790 0.0066274 -12.988 < 2e-16 ***
## z.diff.lag 0.3050851 0.0182524 16.715 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.37 on 2722 degrees of freedom
## Multiple R-squared: 0.123, Adjusted R-squared: 0.1221
## F-statistic: 127.3 on 3 and 2722 DF, p-value: < 2.2e-16
# with tseries function
#Simple Dickey_fuller test
adf.test(siwim_data_hours$Count, k = 0)
## Warning in adf.test(siwim_data_hours$Count, k = 0): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: siwim_data_hours$Count
## Dickey-Fuller = -9.6405, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
#Augmented Dickey-fuller test
adf.test(siwim_data_hours$Count, k = 168)
##
## Augmented Dickey-Fuller Test
##
## data: siwim_data_hours$Count
## Dickey-Fuller = -3.6732, Lag order = 168, p-value = 0.02561
## alternative hypothesis: stationary
#Number of differences required for a stationary series
ndiffs(X)
## [1] 0
La fonction d’autocorrelation (ACF) mesure la correlation entre la série temporelle et elle-même décalée dans le temps (lags t-1, t-2, etc.). Par exemple, pour le lag d’ordre 5, ACF compare la série temporelle à l’instant t avec la série temporelle à l’instant t-5.
Pour une série de type moyenne mobile de lag n, il n’y aura aucune corrélation entre \(x(t)\) et \(x(t-n-1)\). Donc, le graphique d’autocorrélation passe en dessous d’une certaine valeur au \(n^{ème}\) lag. De cette manière, on trouve l’ordre idéal pour une série de type MA (Moving average).
acf(X)
On peut
La fonction d’autocorrélation partielle mesure la corrélation entre une série temporelle et elle-même décalée dans le temps, après avoir éliminé les varaitions déjà expliquées par les comparaisons intermédiaires (dus aux lags précédents). PAr exemple, pour le lag d’ordre 5, elle va comparer la corrélation avec l’instant t, mais elle supprime les effets adèjà expliqués par les lags 1 à 4.
A l’instar du graphique ACF, PACF passera en dessous d’une certaine valeur après un certain lag qui donne l’ordre d’une série de type AR (auto-regressive). Par exemple, si on prendre une série AR d’ordre 1 et si on exclut l’effet du lag d’ordre 1 (\(x(t-1)\)), le lag d’ordre 2 (\(x(t-2)\)) est indépendant de \(x(t)\). Donc la fonction d’autocorrélation partielle diminuera rapidement après le lag d’ordre 1.
pacf(X)
On peut constater que le graphique ACF nous indique un ordre MA de 8 et le graphique PACF nous indique un ordre AR de 1.
Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable. For instance if p is 5, the predictors for x(t) will be x(t-1)..x(t-5). Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation. For instance if q is 5, the predictors for x(t) will be e(t-1)..e(t-5) where e(i) is the difference between the moving average at ith instant and actual value. Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that variable and put d=0 or pass the original variable and put d=1. Both will generate same results.
q - The lag value where the ACF chart crosses the upper confidence interval for the first time p - The lag value where the PACF chart crosses the upper confidence interval for the first time.
\(p = 1\) and \(q = 8\)
# AR model
# forecast window
f_win <- 168
x3 <- xts(siwim_data_hours$Count, order.by = siwim_data_hours$time_step)
attr(x3, 'frequency') <- 24
yt<-x3[1:(length(x3)-f_win)]
model_AR <- arima(yt, order = c(1,0,0))
pd_AR <- funggcast(x3, forecast(model_AR, h = f_win))
p1a<-ggplot(data=pd_AR,aes(x=date,y=observed))
p1a<-p1a+geom_line(col='red')
p1a<-p1a+geom_line(aes(y=fitted),col='blue')
p1a<-p1a+geom_line(aes(y=forecast))+geom_ribbon(aes(ymin=lo95,ymax=hi95),alpha=.25)
p1a<-p1a+scale_x_datetime(name='',breaks=date_breaks('8 hours'),minor_breaks= date_breaks('1 hour'),labels=date_format("%Y-%m-%d %H"),expand=c(0,0))
p1a<-p1a+scale_y_continuous(name='Units of Y')
p1a <- p1a + labs(title='Arima Fit to Simulated Data\n (black=forecast, blue=fitted, red=data, shadow=95% conf. interval)')
p1a<-p1a+theme(axis.text.x=element_text(size=10))
#p1a
dygraph(xts(x=pd_AR[,-c(1,5,6)], order.by = pd_AR[,1]), "Frequence des camions - ARIMA AR") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
#plot(forecast(model_AR, 168))
plot(model_AR$residuals)
acf(model_AR$residuals)
pacf(model_AR$residuals)
# MA model
model_MA <- arima(yt, order = c(0,0,8))
accuracy(model_MA)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01290228 22.13703 15.36456 -67.1872 82.54818 0.9898522
## ACF1
## Training set 0.01013776
pd_MA <- funggcast(x3, forecast(model_MA, h = f_win))
#plot(forecast(model_MA, 168))
dygraph(xts(x=pd_MA[,-c(1,5,6)], order.by = pd_MA[,1]), "Frequence des camions - ARIMA MA") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
plot(model_MA$residuals)
acf(model_MA$residuals)
pacf(model_MA$residuals)
# Composite model
model_ARMA <- arima(yt, order = c(1,0,8))
accuracy(model_ARMA)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.03028622 21.43891 14.38407 -35.04558 59.21652 0.9266842
## ACF1
## Training set 0.02237713
#autoplot(forecast(model_ARMA, 168))
pd_ARMA <- funggcast(x3, forecast(model_ARMA, h = f_win))
dygraph(xts(x=pd_ARMA[,-c(1,5,6)], order.by = pd_ARMA[,1]), "Frequence des camions - ARIMA ARMA") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
autoplot(model_ARMA$residuals)
autoplot(acf(model_ARMA$residuals))
autoplot(pacf(model_ARMA$residuals))
# Auto ARIMA model
model_auto_ARMA <- auto.arima(yt)
accuracy(model_auto_ARMA)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02867587 20.37799 13.58248 -44.71983 64.20429 0.391899
## ACF1
## Training set 0.0004519754
#autoplot(forecast(model_auto_ARMA, 168))
pd_auto_ARIMA <- funggcast(x3, forecast(model_auto_ARMA, h = f_win))
dygraph(xts(x=pd_auto_ARIMA[,-c(1,5,6)], order.by = pd_auto_ARIMA[,1]), "Frequence des camions - ARIMA auto") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
autoplot(model_auto_ARMA$residuals)
autoplot(acf(model_auto_ARMA$residuals))
autoplot(pacf(model_auto_ARMA$residuals))
## auto arima with season
period <- 24 #first season
model_auto_ARMA_seas <- auto.arima(ts(data=yt,frequency = period))
accuracy(model_auto_ARMA_seas)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02867587 20.37799 13.58248 -44.71983 64.20429 0.391899
## ACF1
## Training set 0.0004519754
#summary(model_auto_ARMA_seas)
#autoplot(forecast(model_auto_ARMA_seas, 168))
pd_auto_ARIMA_seas <- funggcast(x3, forecast(model_auto_ARMA_seas, h = f_win))
dygraph(xts(x=pd_auto_ARIMA_seas[,-c(1,5,6)], order.by = pd_auto_ARIMA_seas[,1]), "Frequence des camions - ARIMA saison") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
autoplot(model_auto_ARMA_seas$residuals)
autoplot(acf(model_auto_ARMA_seas$residuals))
autoplot(pacf(model_auto_ARMA_seas$residuals))
## auto arima with double seasons
# Double seasons serie
x6 <- lag(lag(yt,k = 24), k =7)
model_auto_ARMA_double <- auto.arima(x6)
accuracy(model_auto_ARMA_double)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.03641216 20.43041 13.51384 -41.33014 60.19032 0.3870543
## ACF1
## Training set -0.0007796341
#autoplot(forecast(model_auto_ARMA_double, 168))
pd_auto_ARIMA_double_seas <- funggcast(x3, forecast(model_auto_ARMA_double, h = f_win))
dygraph(xts(x=pd_auto_ARIMA_double_seas[,-c(1,5,6)], order.by = pd_auto_ARIMA_double_seas[,1]), "Frequence des camions - ARIMA double saison") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
plot(model_auto_ARMA_double$residuals)
#acf(model_auto_ARMA_double$residuals)
#pacf(model_auto_ARMA_double$residuals)
rbind.data.frame("Modèle AR" = accuracy(model_AR),
"Modèle MA" = accuracy(model_MA),
"Modèle ARMA" = accuracy(model_ARMA),
"Modèle auto ARIMA" = accuracy(model_auto_ARMA),
"Modèle ARIMA avec saison" = accuracy(model_auto_ARMA_seas),
"Modèle ARIMA avec double saison" = accuracy(model_auto_ARMA_double)) %>% kable()
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Modèle AR | 0.0551125 | 23.44477 | 15.64161 | -42.58471 | 59.65201 | 1.0077004 | 0.2510202 |
| Modèle MA | 0.0129023 | 22.13703 | 15.36456 | -67.18720 | 82.54818 | 0.9898522 | 0.0101378 |
| Modèle ARMA | 0.0302862 | 21.43891 | 14.38407 | -35.04558 | 59.21652 | 0.9266842 | 0.0223771 |
| Modèle auto ARIMA | 0.0286759 | 20.37799 | 13.58248 | -44.71983 | 64.20429 | 0.3918990 | 0.0004520 |
| Modèle ARIMA avec saison | 0.0286759 | 20.37799 | 13.58248 | -44.71983 | 64.20429 | 0.3918990 | 0.0004520 |
| Modèle ARIMA avec double saison | 0.0364122 | 20.43041 | 13.51384 | -41.33014 | 60.19032 | 0.3870543 | -0.0007796 |
ts_Y <- as.ts(x3)
dekom <- stl(ts_Y, s.window = "periodic", robust = TRUE)
autoplot(dekom)
#Simple exponential smoothing
siwim_data_hw_ses <- HoltWinters(yt ,beta = FALSE,
gamma = FALSE)
#plot(siwim_data_hw_ses)
pd_hw_ses <- funggcast(x3, forecast(siwim_data_hw_ses, h = f_win))
dygraph(xts(x=pd_hw_ses[,-c(1,5,6)], order.by = pd_hw_ses[,1]), "Frequence des camions - Simple Exponential Smoothing") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
#Exponential smoothing
siwim_data_hw_es <- HoltWinters(yt, gamma = FALSE)
#plot(siwim_data_hw_es)
pd_hw_es <- funggcast(x3, forecast(siwim_data_hw_es, h = f_win))
dygraph(xts(x=pd_hw_es[,-c(1,5,6)], order.by = pd_hw_es[,1]), "Frequence des camions - Exponential Smoothing") %>%
dySeries("observed", label = "Real") %>%
dySeries("fitted", label = "Fitted") %>%
dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
dyRangeSelector()
rbind.data.frame("Sipmle Expoential Smoothing" = accuracy(forecast(siwim_data_hw_ses, h = f_win)), "Expoential Smoothing" =
accuracy(forecast(siwim_data_hw_es, h = f_win))) %>% kable()
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Sipmle Expoential Smoothing | 0.0660460 | 23.83423 | 15.52235 | -13.43676 | 40.45712 | 0.4478706 | 0.2272056 |
| Expoential Smoothing | -0.4874021 | 23.86568 | 15.59277 | -17.08559 | 42.09580 | 0.4499023 | 0.2274684 |
Test d’autocorrélation d’ordre supérieur à 1 sur les résidus :
#Box test sur forecast
Box.test(forecast(siwim_data_hw_es, h = f_win)$residuals, lag = f_win, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: forecast(siwim_data_hw_es, h = f_win)$residuals
## X-squared = 8448, df = 168, p-value < 2.2e-16
x2 <- msts(siwim_data_hours$Count, seasonal.periods = c(24, 24*7), start = 2017 + 07/12 + 5/365 + 23/8760)
# double_hw <- dshw(x2,24, 24*7)
# frequency(x2)
plot(x2)
# dygraph(x2, "Frequence des camions - Double saisonnalité") %>%
# dyRangeSelector()
seasonaldecomp <- tbats(x2)
plot(seasonaldecomp)
accuracy(seasonaldecomp)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.720954 21.69751 13.8904 -20.31345 44.77766 0.886974
## ACF1
## Training set -0.01100627
plot(forecast(seasonaldecomp, 168 *4,level = 0.5))
seasonaldecomp2 <- tbats(siwim_data_hours$Count)
plot(seasonaldecomp2)
accuracy(seasonaldecomp2)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.5908902 22.7422 14.54008 -18.59127 43.58727 0.9284597
## ACF1
## Training set 0.06288263
plot(forecast(seasonaldecomp2, 168 *4))
#First simple linear model
simple_model_freq <- lm(Count ~ 0 + Heure + Jour_semaine, data = siwim_data_hours)
summary(simple_model_freq)
##
## Call:
## lm(formula = Count ~ 0 + Heure + Jour_semaine, data = siwim_data_hours)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.910 -20.320 -4.168 24.816 149.288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Heure00 25.770 3.475 7.416 1.61e-13 ***
## Heure01 24.277 3.453 7.031 2.59e-12 ***
## Heure02 27.132 3.453 7.858 5.59e-15 ***
## Heure03 32.208 3.510 9.175 < 2e-16 ***
## Heure04 38.442 3.442 11.170 < 2e-16 ***
## Heure05 53.700 3.548 15.134 < 2e-16 ***
## Heure06 86.848 3.464 25.073 < 2e-16 ***
## Heure07 129.342 3.453 37.462 < 2e-16 ***
## Heure08 128.230 3.442 37.259 < 2e-16 ***
## Heure09 132.535 3.442 38.509 < 2e-16 ***
## Heure10 135.792 3.453 39.331 < 2e-16 ***
## Heure11 127.555 3.431 37.179 < 2e-16 ***
## Heure12 126.727 3.431 36.938 < 2e-16 ***
## Heure13 130.038 3.420 38.021 < 2e-16 ***
## Heure14 142.964 3.431 41.671 < 2e-16 ***
## Heure15 144.025 3.431 41.980 < 2e-16 ***
## Heure16 131.970 3.442 38.345 < 2e-16 ***
## Heure17 113.281 3.442 32.915 < 2e-16 ***
## Heure18 88.907 3.442 25.833 < 2e-16 ***
## Heure19 66.185 3.453 19.169 < 2e-16 ***
## Heure20 48.580 3.453 14.070 < 2e-16 ***
## Heure21 38.194 3.453 11.062 < 2e-16 ***
## Heure22 30.222 3.510 8.609 < 2e-16 ***
## Heure23 26.888 3.499 7.685 2.12e-14 ***
## Jour_semainemardi 3.946 2.356 1.674 0.094170 .
## Jour_semainemercredi 6.431 2.311 2.782 0.005433 **
## Jour_semainejeudi 8.835 2.339 3.777 0.000162 ***
## Jour_semainevendredi -6.438 2.374 -2.711 0.006747 **
## Jour_semainesamedi -71.354 2.389 -29.867 < 2e-16 ***
## Jour_semainedimanche -77.516 2.430 -31.905 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.9 on 2698 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8779
## F-statistic: 655.1 on 30 and 2698 DF, p-value: < 2.2e-16
plot(
simple_model_freq$coefficients[1:23], type = "h", xlab = "Heure", ylab = "coefficient")
# simple_model_freq$residuals
#Diagnostic de la regression
#plot(siwim_data_hours$Heure,rstudent(simple_model_freq))
#mesusres d'influence du modèle dont hii
mes <- influence.measures(simple_model_freq)
#graph des hii
plot(sort(mes$infmat[,"hat"]), type ="h")
#Prediction du modèle simple
pred <- predict(simple_model_freq, cbind.data.frame(Heure = siwim_data_hours$Heure, Mois_annee = siwim_data_hours$Mois_annee, Jour_semaine = siwim_data_hours$Jour_semaine))
# plot(pred, type = "l", col = "red")
# lines(siwim_data_hours$Count, col = "green")
### Comparison graphs of predicted vs real
datas <- rbindlist(list(siwim_data_hours[, .(Count, time_step)],
data.table(Count = simple_model_freq$fitted.values,
time_step = siwim_data_hours[, time_step])))
datas[, type := rep(c("Real", "Fitted"), each = nrow(siwim_data_hours))]
ggplot(data = datas, aes(time_step, Count, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Frequency",
title = "Fit from simple MLR")
## Visu for residuals
ggplot(data = data.table(Fitted_values = simple_model_freq$fitted.values,
Residuals = simple_model_freq$residuals),
aes(Fitted_values, Residuals)) +
geom_point(size = 1.7) +
geom_smooth() +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Fitted values vs Residuals")
## `geom_smooth()` using method = 'gam'
## Function to visualize normaliy of residuals
ggQQ <- function(lm){
# extract standardized residuals from the fit
d <- data.frame(std.resid = rstandard(lm))
# calculate 1Q/4Q line
y <- quantile(d$std.resid[!is.na(d$std.resid)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
p <- ggplot(data = d, aes(sample = std.resid)) +
stat_qq(shape = 1, size = 3) + # open circles
labs(title = "Normal Q-Q", # plot title
x = "Theoretical Quantiles", # x-axis label
y = "Standardized Residuals") + # y-axis label
geom_abline(slope = slope, intercept = int, linetype = "dashed",
size = 1, col = "firebrick1") # dashed reference line
return(p)
}
ggQQ(simple_model_freq)
################# Linear models with interactions #####################################
model_interact_freq <- lm(Count ~ 0 + Heure + Jour_semaine + Heure:Jour_semaine,
data = siwim_data_hours)
summary(model_interact_freq)
##
## Call:
## lm(formula = Count ~ 0 + Heure + Jour_semaine + Heure:Jour_semaine,
## data = siwim_data_hours)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.667 -3.891 0.250 6.250 122.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Heure00 5.1875 5.1800 1.001 0.316704
## Heure01 7.9375 5.1800 1.532 0.125562
## Heure02 9.0000 5.1800 1.737 0.082427 .
## Heure03 13.1250 5.1800 2.534 0.011343 *
## Heure04 17.0625 5.1800 3.294 0.001001 **
## Heure05 41.6250 5.1800 8.036 1.41e-15 ***
## Heure06 71.0000 5.1800 13.707 < 2e-16 ***
## Heure07 127.1250 5.1800 24.542 < 2e-16 ***
## Heure08 140.7500 5.1800 27.172 < 2e-16 ***
## Heure09 150.1250 5.1800 28.982 < 2e-16 ***
## Heure10 158.5625 5.1800 30.611 < 2e-16 ***
## Heure11 150.2500 5.1800 29.006 < 2e-16 ***
## Heure12 148.6875 5.1800 28.704 < 2e-16 ***
## Heure13 150.2500 5.1800 29.006 < 2e-16 ***
## Heure14 169.1250 5.1800 32.650 < 2e-16 ***
## Heure15 165.3750 5.1800 31.926 < 2e-16 ***
## Heure16 146.9375 5.1800 28.366 < 2e-16 ***
## Heure17 113.8125 5.1800 21.972 < 2e-16 ***
## Heure18 86.2500 5.1800 16.651 < 2e-16 ***
## Heure19 61.3750 5.1800 11.848 < 2e-16 ***
## Heure20 37.3125 5.1800 7.203 7.70e-13 ***
## Heure21 27.8750 5.1800 5.381 8.07e-08 ***
## Heure22 17.3125 5.1800 3.342 0.000843 ***
## Heure23 13.7500 5.1800 2.654 0.007993 **
## Jour_semainemardi 4.5625 7.3256 0.623 0.533461
## Jour_semainemercredi 4.7569 7.1192 0.668 0.504076
## Jour_semainejeudi 6.7537 7.2171 0.936 0.349469
## Jour_semainevendredi 6.5625 7.3256 0.896 0.370428
## Jour_semainesamedi 2.5625 7.3256 0.350 0.726517
## Jour_semainedimanche -2.8798 7.7367 -0.372 0.709755
## Heure01:Jour_semainemardi -3.4412 10.2835 -0.335 0.737931
## Heure02:Jour_semainemardi -1.0625 10.3600 -0.103 0.918322
## Heure03:Jour_semainemardi -0.8125 10.3600 -0.078 0.937495
## Heure04:Jour_semainemardi 2.5625 10.3600 0.247 0.804660
## Heure05:Jour_semainemardi -1.4375 10.3600 -0.139 0.889655
## Heure06:Jour_semainemardi 15.6250 10.3600 1.508 0.131625
## Heure07:Jour_semainemardi 18.8750 10.3600 1.822 0.068584 .
## Heure08:Jour_semainemardi -1.4375 10.3600 -0.139 0.889655
## Heure09:Jour_semainemardi -10.6875 10.3600 -1.032 0.302350
## Heure10:Jour_semainemardi -8.5000 10.3600 -0.820 0.412027
## Heure11:Jour_semainemardi -11.8125 10.3600 -1.140 0.254307
## Heure12:Jour_semainemardi -11.1875 10.3600 -1.080 0.280299
## Heure13:Jour_semainemardi -8.9301 10.2835 -0.868 0.385260
## Heure14:Jour_semainemardi -8.7463 10.2835 -0.851 0.395116
## Heure15:Jour_semainemardi -2.9375 10.2835 -0.286 0.775168
## Heure16:Jour_semainemardi 7.9706 10.2835 0.775 0.438362
## Heure17:Jour_semainemardi 22.1544 10.2835 2.154 0.031305 *
## Heure18:Jour_semainemardi 9.2463 10.2835 0.899 0.368663
## Heure19:Jour_semainemardi 1.3566 10.2835 0.132 0.895057
## Heure20:Jour_semainemardi -2.6397 10.2835 -0.257 0.797436
## Heure21:Jour_semainemardi -9.2022 10.2835 -0.895 0.370951
## Heure22:Jour_semainemardi -3.5809 10.2835 -0.348 0.727707
## Heure23:Jour_semainemardi -7.0184 10.2835 -0.682 0.494992
## Heure01:Jour_semainemercredi -2.3787 10.0056 -0.238 0.812107
## Heure02:Jour_semainemercredi -0.8096 10.0056 -0.081 0.935518
## Heure03:Jour_semainemercredi 2.4122 10.1375 0.238 0.811943
## Heure04:Jour_semainemercredi 2.6250 10.0681 0.261 0.794326
## Heure05:Jour_semainemercredi 3.3958 10.0681 0.337 0.735928
## Heure06:Jour_semainemercredi 17.1319 10.0681 1.702 0.088950 .
## Heure07:Jour_semainemercredi 16.0069 10.0681 1.590 0.111988
## Heure08:Jour_semainemercredi -1.0625 10.0681 -0.106 0.915962
## Heure09:Jour_semainemercredi -0.4931 10.0681 -0.049 0.960945
## Heure10:Jour_semainemercredi -13.3750 10.0681 -1.328 0.184146
## Heure11:Jour_semainemercredi -13.2175 10.0056 -1.321 0.186615
## Heure12:Jour_semainemercredi -8.0234 10.0056 -0.802 0.422689
## Heure13:Jour_semainemercredi -7.6385 10.0056 -0.763 0.445278
## Heure14:Jour_semainemercredi -0.3819 10.0681 -0.038 0.969742
## Heure15:Jour_semainemercredi 9.5347 10.0681 0.947 0.343716
## Heure16:Jour_semainemercredi 7.4167 10.0681 0.737 0.461402
## Heure17:Jour_semainemercredi 28.3129 10.1375 2.793 0.005263 **
## Heure18:Jour_semainemercredi 11.5813 10.1375 1.142 0.253387
## Heure19:Jour_semainemercredi 5.2210 10.1375 0.515 0.606586
## Heure20:Jour_semainemercredi -1.8930 10.1375 -0.187 0.851888
## Heure21:Jour_semainemercredi -8.0437 10.1375 -0.793 0.427584
## Heure22:Jour_semainemercredi -4.8342 10.1375 -0.477 0.633505
## Heure23:Jour_semainemercredi -3.9775 10.1375 -0.392 0.694827
## Heure01:Jour_semainejeudi -6.4134 10.1375 -0.633 0.527025
## Heure02:Jour_semainejeudi -4.0315 10.1375 -0.398 0.690902
## Heure03:Jour_semainejeudi 0.4743 10.2065 0.046 0.962942
## Heure04:Jour_semainejeudi 2.7132 10.2065 0.266 0.790388
## Heure05:Jour_semainejeudi 0.3272 10.2065 0.032 0.974428
## Heure06:Jour_semainejeudi 20.6581 10.2065 2.024 0.043072 *
## Heure07:Jour_semainejeudi 19.5102 10.1375 1.925 0.054396 .
## Heure08:Jour_semainejeudi 3.2741 10.1375 0.323 0.746745
## Heure09:Jour_semainejeudi -8.2676 10.1375 -0.816 0.414839
## Heure10:Jour_semainejeudi -3.3750 10.2065 -0.331 0.740919
## Heure11:Jour_semainejeudi -11.8860 10.2065 -1.165 0.244307
## Heure12:Jour_semainejeudi -10.5588 10.2065 -1.035 0.300990
## Heure13:Jour_semainejeudi -4.1801 10.2065 -0.410 0.682165
## Heure14:Jour_semainejeudi 0.1213 10.2065 0.012 0.990517
## Heure15:Jour_semainejeudi 9.0478 10.2065 0.886 0.375445
## Heure16:Jour_semainejeudi 19.4265 10.2065 1.903 0.057107 .
## Heure17:Jour_semainejeudi 17.7279 10.2065 1.737 0.082520 .
## Heure18:Jour_semainejeudi 12.9375 10.2065 1.268 0.205065
## Heure19:Jour_semainejeudi 2.4338 10.2835 0.237 0.812930
## Heure20:Jour_semainejeudi 3.6838 10.2835 0.358 0.720204
## Heure21:Jour_semainejeudi -3.8787 10.2835 -0.377 0.706075
## Heure22:Jour_semainejeudi -4.3787 10.2835 -0.426 0.670293
## Heure23:Jour_semainejeudi -9.0037 10.2835 -0.876 0.381360
## Heure01:Jour_semainevendredi -4.0000 10.3600 -0.386 0.699454
## Heure02:Jour_semainevendredi -1.6875 10.3600 -0.163 0.870621
## Heure03:Jour_semainevendredi -0.1250 10.3600 -0.012 0.990374
## Heure04:Jour_semainevendredi 4.1250 10.3600 0.398 0.690540
## Heure05:Jour_semainevendredi 5.5625 10.3600 0.537 0.591368
## Heure06:Jour_semainevendredi 18.9375 10.3600 1.828 0.067674 .
## Heure07:Jour_semainevendredi 21.8125 10.3600 2.105 0.035349 *
## Heure08:Jour_semainevendredi 3.0000 10.3600 0.290 0.772164
## Heure09:Jour_semainevendredi -1.3750 10.3600 -0.133 0.894423
## Heure10:Jour_semainevendredi -18.6875 10.3600 -1.804 0.071377 .
## Heure11:Jour_semainevendredi -19.2500 10.3600 -1.858 0.063268 .
## Heure12:Jour_semainevendredi -24.3750 10.3600 -2.353 0.018708 *
## Heure13:Jour_semainevendredi -17.0625 10.3600 -1.647 0.099688 .
## Heure14:Jour_semainevendredi -35.1250 10.3600 -3.390 0.000708 ***
## Heure15:Jour_semainevendredi -31.8750 10.3600 -3.077 0.002115 **
## Heure16:Jour_semainevendredi -43.8125 10.3600 -4.229 2.43e-05 ***
## Heure17:Jour_semainevendredi -40.2500 10.3600 -3.885 0.000105 ***
## Heure18:Jour_semainevendredi -34.6875 10.3600 -3.348 0.000825 ***
## Heure19:Jour_semainevendredi -30.5000 10.3600 -2.944 0.003269 **
## Heure20:Jour_semainevendredi -20.1250 10.3600 -1.943 0.052177 .
## Heure21:Jour_semainevendredi -18.3125 10.3600 -1.768 0.077244 .
## Heure22:Jour_semainevendredi -12.5000 10.3600 -1.207 0.227711
## Heure23:Jour_semainevendredi -11.6875 10.3600 -1.128 0.259367
## Heure01:Jour_semainesamedi -2.8750 10.3600 -0.278 0.781411
## Heure02:Jour_semainesamedi -2.6250 10.3600 -0.253 0.799996
## Heure03:Jour_semainesamedi -1.0000 10.3600 -0.097 0.923111
## Heure04:Jour_semainesamedi -4.4375 10.3600 -0.428 0.668446
## Heure05:Jour_semainesamedi -28.1250 10.3600 -2.715 0.006677 **
## Heure06:Jour_semainesamedi -56.5000 10.3600 -5.454 5.41e-08 ***
## Heure07:Jour_semainesamedi -107.8125 10.3600 -10.407 < 2e-16 ***
## Heure08:Jour_semainesamedi -123.1250 10.3600 -11.885 < 2e-16 ***
## Heure09:Jour_semainesamedi -129.3125 10.3600 -12.482 < 2e-16 ***
## Heure10:Jour_semainesamedi -137.2500 10.3600 -13.248 < 2e-16 ***
## Heure11:Jour_semainesamedi -131.9375 10.3600 -12.735 < 2e-16 ***
## Heure12:Jour_semainesamedi -129.6875 10.3600 -12.518 < 2e-16 ***
## Heure13:Jour_semainesamedi -134.1250 10.3600 -12.946 < 2e-16 ***
## Heure14:Jour_semainesamedi -155.8125 10.3600 -15.040 < 2e-16 ***
## Heure15:Jour_semainesamedi -151.1875 10.3600 -14.593 < 2e-16 ***
## Heure16:Jour_semainesamedi -134.0333 10.4460 -12.831 < 2e-16 ***
## Heure17:Jour_semainesamedi -102.5625 10.3600 -9.900 < 2e-16 ***
## Heure18:Jour_semainesamedi -74.1250 10.3600 -7.155 1.09e-12 ***
## Heure19:Jour_semainesamedi -56.5625 10.3600 -5.460 5.23e-08 ***
## Heure20:Jour_semainesamedi -33.7500 10.3600 -3.258 0.001138 **
## Heure21:Jour_semainesamedi -26.5000 10.3600 -2.558 0.010587 *
## Heure22:Jour_semainesamedi -16.8750 10.7830 -1.565 0.117715
## Heure23:Jour_semainesamedi -13.8125 10.7830 -1.281 0.200327
## Heure01:Jour_semainedimanche -2.9744 11.0664 -0.269 0.788125
## Heure02:Jour_semainedimanche -4.6587 10.9414 -0.426 0.670301
## Heure03:Jour_semainedimanche -8.6997 11.2124 -0.776 0.437877
## Heure04:Jour_semainedimanche -12.4952 10.6546 -1.173 0.241006
## Heure05:Jour_semainedimanche -37.1738 12.1663 -3.055 0.002270 **
## Heure06:Jour_semainedimanche -64.1202 10.8330 -5.919 3.67e-09 ***
## Heure07:Jour_semainedimanche -120.7452 10.8330 -11.146 < 2e-16 ***
## Heure08:Jour_semainedimanche -132.4702 10.7382 -12.336 < 2e-16 ***
## Heure09:Jour_semainedimanche -137.0452 10.7382 -12.762 < 2e-16 ***
## Heure10:Jour_semainedimanche -141.4160 10.7382 -13.169 < 2e-16 ***
## Heure11:Jour_semainedimanche -131.1827 10.6546 -12.312 < 2e-16 ***
## Heure12:Jour_semainedimanche -131.1827 10.6546 -12.312 < 2e-16 ***
## Heure13:Jour_semainedimanche -131.6202 10.6546 -12.353 < 2e-16 ***
## Heure14:Jour_semainedimanche -147.2452 10.6546 -13.820 < 2e-16 ***
## Heure15:Jour_semainedimanche -147.0577 10.6546 -13.802 < 2e-16 ***
## Heure16:Jour_semainedimanche -129.0577 10.6546 -12.113 < 2e-16 ***
## Heure17:Jour_semainedimanche -91.6827 10.6546 -8.605 < 2e-16 ***
## Heure18:Jour_semainedimanche -66.1827 10.6546 -6.212 6.10e-10 ***
## Heure19:Jour_semainedimanche -46.4327 10.6546 -4.358 1.36e-05 ***
## Heure20:Jour_semainedimanche -23.0577 10.6546 -2.164 0.030549 *
## Heure21:Jour_semainedimanche -17.8077 10.6546 -1.671 0.094773 .
## Heure22:Jour_semainedimanche -9.3660 10.7382 -0.872 0.383175
## Heure23:Jour_semainedimanche -6.8077 10.6546 -0.639 0.522918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.72 on 2560 degrees of freedom
## Multiple R-squared: 0.9546, Adjusted R-squared: 0.9516
## F-statistic: 320.2 on 168 and 2560 DF, p-value: < 2.2e-16
c(Previous = summary(simple_model_freq)$r.squared, New = summary(model_interact_freq)$r.squared)
## Previous New
## 0.8792895 0.9545746
## Residuals comparison
ggplot(data.table(Residuals = c(simple_model_freq$residuals, model_interact_freq$residuals),
Type = c(rep("MLR - simple", nrow(siwim_data_hours)),
rep("MLR with interactions", nrow(siwim_data_hours)))),
aes(Type, Residuals, fill = Type)) +
geom_boxplot()
ggplotly()
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Graph comparison of predictred vs real values
datas <- rbindlist(list(siwim_data_hours[, .(Count, time_step)],
data.table(Count = model_interact_freq$fitted.values,
time_step = siwim_data_hours[, time_step])))
datas[, type := rep(c("Real", "Fitted"), each = nrow(siwim_data_hours))]
ggplot(data = datas, aes(time_step, Count, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Frequency",
title = "Fit from MLR with interactions")
## Visu of residuals
ggplot(data = data.table(Fitted_values = model_interact_freq$fitted.values,
Residuals = model_interact_freq$residuals),
aes(Fitted_values, Residuals)) +
geom_point(size = 1.7) +
# geom_smooth() +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Fitted values vs Residuals")
ggQQ(model_interact_freq)
## lags
nb_lags <- 168
lags <- paste("lag", 1:nb_lags, sep = "_")
## Build matrices with dummmy variables
siwim_data_hours_X <- siwim_data_hours[, c("Count", "Heure", "Jour_semaine")]
siwim_data_hours_X[ , (lags) := shift(siwim_data_hours$Count, 1:nb_lags)]
model_step <- step(lmfit, direction = "both", k = log(nrow(siwim_data_hours_X)))
Résultats de la sélection :
## (Intercept) Heure01 Heure02
## 9.59421759 -1.09020865 0.54205982
## Heure03 Heure04 Heure05
## 2.64088171 1.63177828 11.86325680
## Heure06 Heure07 Heure08
## 26.84855041 43.32919136 9.67081567
## Heure09 Heure10 Heure11
## 11.92554829 17.92234708 10.30276396
## Heure12 Heure13 Heure14
## 12.39874845 19.02926018 28.53946273
## Heure15 Heure16 Heure17
## 21.89094857 11.08178061 2.39168501
## Heure18 Heure19 Heure20
## -3.62807002 -6.43401699 -7.42600959
## Heure21 Heure22 Heure23
## -5.58118969 -4.73008825 -2.14588119
## Jour_semainedimanche Jour_semainejeudi Jour_semainemardi
## -12.31223273 2.75519186 -2.16598716
## Jour_semainemercredi Jour_semainesamedi Jour_semainevendredi
## 3.02810112 -12.31277735 0.92405967
## lag_1 lag_2 lag_4
## 0.75114780 0.14287950 -0.05637018
## lag_9 lag_15 lag_23
## -0.10768881 0.05869562 0.05299710
## lag_27 lag_29 lag_45
## -0.08286029 0.05671318 -0.06393391
## lag_95 lag_100 lag_111
## -0.04298543 0.03068071 -0.02257531
## lag_143 lag_164 lag_168
## 0.02357880 0.03073459 -0.02753100
## lag_90
## 0.02578676
Les séries temporelles sont des observations ordonnées chronologiquement. Ainsi, contrairment aux données non temporelles, leur ordre a une importance dont il faut tenir compte lors de la validation croisée. Pour cela 2 approches existent :
estimation du modèle sur un sous-échantillon dont la taille augmente de façon incrémentale, en tenant compte de l’odre temporel des données. La prévision est réalisée sur les valeurs postérieures à cet échantillon pour un horizon de prévision donné.
estimation du modèle sur un groupe défini par une fenêtre glissante de taille fixe. La prévision est réalisée sur les valeurs postérieures à cet échantillon pour un horizon de prévision donné.